Propagate min/max/string length statistics to duckdb#7416
Propagate min/max/string length statistics to duckdb#7416
Conversation
5848867 to
c4a56e6
Compare
c4a56e6 to
69e130d
Compare
Polar Signals Profiling ResultsLatest Run
Previous Runs (3)
Powered by Polar Signals Cloud |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.234x ❌, 0↑ 97↓)
datafusion / vortex-compact (1.173x ❌, 0↑ 94↓)
datafusion / parquet (1.188x ❌, 0↑ 97↓)
duckdb / vortex-file-compressed (1.397x ❌, 0↑ 86↓)
duckdb / vortex-compact (1.493x ❌, 3↑ 87↓)
duckdb / parquet (1.297x ❌, 3↑ 85↓)
duckdb / duckdb (1.131x ❌, 2↑ 70↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMEFile Size Changes (48 files changed, +0.0% overall, 48↑ 0↓)
Totals:
|
e2e1bbe to
46c5eaf
Compare
File Sizes: Statistical and Population GeneticsFile Size Changes (2 files changed, +0.0% overall, 2↑ 0↓)
Totals:
|
File Sizes: Clickbench on NVMEFile Size Changes (201 files changed, -0.0% overall, 200↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.915x ➖ unknown / unknown (1.000x ➖, 6↑ 3↓)
|
Benchmarks: CompressionVortex (geomean): 1.012x ➖ unknown / unknown (1.027x ➖, 1↑ 7↓)
|
46c5eaf to
276db60
Compare
File Sizes: TPC-H SF=1 on NVMEFile Size Changes (18 files changed, +0.0% overall, 18↑ 0↓)
Totals:
|
File Sizes: TPC-H SF=10 on NVMEFile Size Changes (48 files changed, +0.0% overall, 48↑ 0↓)
Totals:
|
|
Looks like some real regressions here. Opening all the files AoT is not a good idea, its very expensive. |
gatesn
left a comment
There was a problem hiding this comment.
I think the regression might be eagerly merging all stats for all columns, when we only need min/max/str_len for columns that DuckDB asks us about
| column_index: usize, | ||
| ) -> &'a ColumnStatistics { | ||
| match &bind_data.stats { | ||
| DataSourceStatistics::All(items) => &items[column_index], |
There was a problem hiding this comment.
We should lazily construct stats for each column. Could be way fewer columns that DuckDB asks for
| self.statistics.as_ref() | ||
| } | ||
|
|
||
| pub fn take_statistics(&mut self) -> Option<FileStatistics> { |
| self.footer.statistics() | ||
| } | ||
|
|
||
| pub fn take_file_stats(&mut self) -> Option<FileStatistics> { |
Merging this PR will improve performance by 21.19%
Performance Changes
Comparing Footnotes
|
Signed-off-by: Mikhail Kot <to@myrrc.dev>
fd7c623 to
158c527
Compare
|
statistics isn't the right function here as parquet only reports it for a single file (I guess they also tried eager readers first). I'll keep it, but also we need get_partition_stats with cached metadata |
Uh oh!
There was an error while loading. Please reload this page.